# Evaluation of LLM-Integrated System for Training Child Helpline Counsellors
Authors: Adarsh Denga
Contact: adenga@tudelft.nl

This dataset is created as part of an evaluation study for a virtual LLM-integrated training system for training child helpline counsellors. The design of this evaluation study was pre-registered under the Open Science Framework (OSF) registries and is publicly available at https://osf.io/6g7e2. The dataset contains participants' survey responses with regards to our measures - human-like behaviour, natural behaviour, engagement, attitude and overall performance for both the LLM-integrated and rule-based systems. The markdown script contains data-specific information on how to make use of this dataset. Participants were recruited through the online platform Prolific, and the data was collected through an online survey hosted on Qualtrics. The questionnaires for our measures are from the Artificial Social Agent Questionnaire, available at https://ii.tudelft.nl/evalquest/web/node/1. 

All statistical analyses were done using R software (version 4.4.2). This work is licensed under CC BY 4.0.

The original data files obtained from the experiment contain confidential data, and hence a cleaned and anonymized version is published.

## Files
- Experiment Analysis.Rmd: The R markdown file explaining the data and outlining the statistical tests we perform as part of this evaluation study.
- Experiment-Analysis.pdf: The knit PDF file from Experiment Analysis.Rmd
- readme.md: This file, which outlines all of the files in the data package and their purpose.

CSV Files:
- data_raw_llm.csv: The raw experiment data from our partipants after using the LLM-based system.
- data_raw_rbs.csv: The raw experiment data from our partipants after using the rule-based system.
- constructs_averaged_llm.csv: Participants' data from the LLM-based system split into the averages for our five measures.
- constructs_averaged_rbs.csv: Participants' data from the rule-based system split into the averages for our five measures. 
- qual.csv: Participants' qualitative responses from using both the LLM and rule-based systems.

R Scripts:
- binomialtest.R: The R script used to perform a binomial test to determine if the preference between the two systems is statistically significant.
- cohenskappa.R: The R script used to obtain the degree of agreement between our two coders for the codes in the qualitative feedback for the thematic analysis.
- poweranalysis.R: The R script used to determine the required sample size given the target power, type of test, and alpha error probability.
- ttest.R: The R script used to perform the paired-samples t-test for our five measures between the LLM and rule-based systems.

Python Scripts:
- construct_split.py: The Python script used to translate the raw experiment data from data_raw_llm.csv and data_raw_rbs.csv to constructs_averaged_llm.csv and constructs_averaged_rbs.csv respectively.

## File Structure

### data_raw_(llm/rbs).csv:
Both of the raw data files contain the following columns:
- PROLIFIC_PID: Anonymised participant ID
- Q1-Q34: Feedback for the 34 quantitative questions from the questionnaire
The columns Q1-Q34 correspond to our measures as follows:
- Human-Like Behaviour: Q2, Q25-Q28
- Natural Behaviour: Q4, Q29-Q30
- Engagement: Q13, Q31-Q32
- Attitude: Q19, Q33-Q34
- Overall Performance: Q1-Q24

### constructs_averaged(llm/rbs).csv:
Both of the averaged data files contain the following columns:
- PROLIFIC_PID: Anonymised participant ID
- OVERALL: Overall performance score, averaged from the columns corresponding described above
- BELIEVABILITY1: Human-like behaviour score, averaged from the corresponding columns described above
- BELIEVABILITY2: Natural behaviour score, averaged from the corresponding columns described above
- Engagement: Engagement score, averaged from the corresponding columns described above
- Attitude: Attituded score, averaged from the corresponding columns described above

### qual.csv
The qualitative feedback file contains the following columns:
- PROLIFIC_PID: Anonymised participant ID
- CONDITION_ORDER: Flag to determine the order in which the participants interacted with the two conditions (LLM-based and Rule-Based)
- LLM: Participants' qualitative feedback from using the LLM-based system
- RBS: Participants' qualitative feedback from using the rule-based system

## Thematic Analysis Codes
The file cohenskappa.R contains the results from our coders for the thematic analysis, given as numbers from 1-12. The corresponding codes for each number are:
1. Human-Like Responses
2. Emotional Engagement
3. Positive Experience
4. Boring Responses
5. Abrupt Ending
6. Unnatural Responses
7. Slow Responses
8. Personality
9. Depth of Conversation
10. Five-Phase Model
11. Scripted Responses
12. Technical Issues